// AI Consulting Toolkit

Architecture Decision Sheet

A comprehensive reference for selecting the right tools at each layer of an AI system โ€” from MVP to production.

MVP Phase
Production Phase
Both Phases
Azure Stack
12 Architecture Layers
Filter Phase:
MVP
PROD
BOTH
AZURE
OSS
SaaS
๐Ÿง 
AI Backbone โ€” LLM Provider
Core intelligence layer: model selection, API provider, deployment model
โ–ผ
๐Ÿ’ก
Decision principle: Choose your orchestration framework first, LLM provider second. Most frameworks are model-agnostic. For MVP, start with hosted APIs. For production, evaluate latency, cost-per-token, data sovereignty, and fine-tuning needs.
Tool / ModelPhaseTypeWhy Choose ItTradeoffsComplexityAlternatives
GPT-4o / GPT-4.1 SaaS
OpenAI โ€” multimodal, text/image/audio
BOTH
SaaS API Best-in-class reasoning, massive ecosystem, vision built-in. Go-to for complex agent tasks, code gen, document analysis.
โš  Data leaves your infra. Cost scales with tokens. Rate limits on free tiers.
LOW Claude 3.5, Gemini 1.5 Pro
Claude 3.5 / Claude 4 SaaS
Anthropic โ€” 200k context, strong reasoning
BOTH
SaaS API Best for long-document analysis, instruction following, and low hallucination. 200k context window is unmatched for RAG with large docs.
โš  No self-hosting. Limited fine-tuning options.
LOW GPT-4o, Gemini
Azure OpenAI Service AZURE
GPT-4o on Azure infra โ€” enterprise compliance
BOTH
Managed Data stays in your Azure tenant. HIPAA/SOC2 compliant. Required for enterprise/gov clients. PTU for guaranteed throughput in prod.
โš  Slower model updates vs OpenAI direct. Needs Azure subscription setup.
MED OpenAI direct API
Llama 3 / Mistral OSS
Self-hosted open-source models
PROD
OSS Zero per-token cost at scale. Full data sovereignty. Fine-tunable for domain-specific tasks. Deploy on your GPU infra or AKS.
โš  Needs GPU infra. Ops overhead. Smaller context. Weaker reasoning than GPT-4 class.
HIGH Phi-3, Gemma 2
vLLM / Ollama OSS
Self-hosted LLM inference server
PROD
Runtime High-throughput batched inference for production. vLLM: best for multi-user APIs. Ollama: best for local dev and edge deployment.
โš  Infra management required. GPU costs.
HIGH TGI (HuggingFace), LMDeploy
โš™๏ธ
Orchestration Framework
The "nervous system" โ€” agent coordination, workflow, tool use, multi-agent patterns
โ–ผ
๐Ÿ’ก
Decision principle: This is the FIRST architectural decision. It shapes everything else. LangGraph = stateful/cyclical agents. CrewAI = role-based teams. AutoGen = conversation-based multi-agent. Semantic Kernel = enterprise/.NET first.
ToolPhaseTypeWhy Choose ItTradeoffsComplexityAlternatives
LangGraph OSS
LangChain โ€” stateful graph-based agent orchestration
BOTH OSS Best for complex, stateful agents with loops, branching, human-in-the-loop. Built-in checkpointing, memory, streaming. Most production-ready OSS option.
โš  Steep learning curve. Verbose graph definition.
HIGH CrewAI, AutoGen
CrewAI OSS
Role-based multi-agent crews
MVP OSS Fastest path to multi-agent MVP. Intuitive agent/task/crew abstraction. Great for sequential role delegation pipelines. Client demos well.
โš  Less control over state. Less flexible for non-sequential flows.
LOW LangGraph, n8n
AutoGen v0.4 OSS
Microsoft โ€” conversation-based multi-agent
BOTH OSS Best for coding agents, autonomous problem solving, multi-agent debate patterns. AgentChat API is clean. Strong Azure/Microsoft ecosystem alignment.
โš  Less structured workflow control than LangGraph.
MED LangGraph, CrewAI
Semantic Kernel AZURE
Microsoft โ€” enterprise SDK for AI orchestration
PROD SDK Best choice if client is .NET/C# shop or deep Azure tenant. Native Azure AI Foundry integration, enterprise memory patterns, plugin architecture.
โš  Python support lags .NET. Smaller community than LangChain.
MED LangGraph + Azure OpenAI
n8n OSS
Low-code workflow automation with AI nodes
MVP OSS Ideal for deterministic, operational automation (CRM sync, email triage, data pipelines). Non-dev stakeholders can edit. Fast POC delivery.
โš  Not suited for complex reasoning or autonomous agents. Not production-grade AI logic.
LOW Make.com, Zapier AI
๐Ÿ”
Vector Store / Semantic Search
Embedding storage, ANN search, retrieval backbone for RAG systems
โ–ผ
๐Ÿ’ก
Decision principle: Vector DB handles semantic/fuzzy search. SQL handles exact/structured retrieval. Best RAG architectures use BOTH: vector search returns candidate IDs โ†’ SQL resolves to full structured records. Choose vector DB based on data volume, filter needs, and managed vs self-hosted preference.
ToolPhaseTypeWhy Choose ItTradeoffsScaleAlternatives
ChromaDB OSS
In-process or client-server vector DB
MVP OSS Zero infrastructure to set up. Perfect for POC and MVP. Runs in-process in Python. Easy LangChain/LlamaIndex integration.
โš  Not production-grade at scale. Limited metadata filtering. No managed cloud offering.
LOW โ€” <1M vecs FAISS, Qdrant
Qdrant OSS
Rust-based, high-performance vector DB
BOTH OSS Best OSS option for production. Rich payload filtering, sparse+dense hybrid search, cloud and self-hosted. Docker-ready, fast.
โš  Smaller ecosystem than Pinecone. Needs infra management if self-hosted.
MED โ€” millions Weaviate, Pinecone
Pinecone SaaS
Fully managed, serverless vector DB
PROD Managed Zero ops. Serverless pricing model. Strong enterprise support. Best for teams without MLOps capacity who need reliable prod vector search.
โš  Vendor lock-in. Data leaves infra. Cost at scale. No SQL-style joins.
HIGH โ€” billions Weaviate Cloud, Qdrant Cloud
Azure AI Search AZURE
Cognitive Search + vector indexing on Azure
PROD Managed Best choice for Azure-native stack. Hybrid search (keyword + vector), integrated with Azure OpenAI, Cosmos DB, Blob Storage. Enterprise SLA.
โš  Azure lock-in. Higher cost than OSS. Slower feature velocity.
HIGH โ€” enterprise Qdrant + AKS
Weaviate OSS
GraphQL API, multi-modal, hybrid search
PROD OSS Best for multi-modal (text + image) retrieval. Built-in vectorizer modules, GraphQL API, object-level permissions. Strong enterprise roadmap.
โš  Higher resource usage. GraphQL adds learning curve.
HIGH Qdrant, Pinecone
pgvector OSS
Postgres extension for vector search
MVP Extension Keep everything in one DB (Postgres). No extra infra. Perfect when data volume is modest and you want SQL + vector in one query. Supabase includes it.
โš  Slower ANN at large scale vs dedicated vector DBs. Not purpose-built.
LOW โ€” <500k vecs ChromaDB, Qdrant
๐Ÿ—„๏ธ
Database โ€” Structured Storage
Relational, NoSQL, graph, and time-series storage for operational data
โ–ผ
ToolPhaseTypeWhy Choose ItTradeoffsUse Case FitAlternatives
SQLite OSS
Embedded, serverless relational DB
MVP Embedded Zero setup. File-based. Perfect for agent memory, chat history, structured episodic memory stores in MVP. Pairs well with ChromaDB.
โš  No concurrent writes. Not for multi-user prod.
Agent memory, local dev PostgreSQL
PostgreSQL OSS
Gold standard relational DB + pgvector
BOTH OSS Best all-round choice. ACID, JSON support, pgvector extension, mature ecosystem. If in doubt, choose Postgres. Scales to most production workloads.
โš  Needs ops at scale. Not globally distributed natively.
Everything structured MySQL, SQLite
Azure Cosmos DB AZURE
Globally distributed NoSQL + vector preview
PROD Managed Best for globally distributed, multi-region Azure deployments. Multiple APIs (SQL, MongoDB, Cassandra). NoSQL for flexible schemas + now has vector search.
โš  Expensive. RU model confusing. Azure lock-in.
Global chatbots, IoT, sessions MongoDB Atlas, DynamoDB
MongoDB Atlas SaaS
Managed document DB with vector search
BOTH Managed Great for flexible schemas (chat history, agent state, unstructured docs). Atlas Vector Search = no separate vector DB needed for moderate scale.
โš  Not relational โ€” joins are painful. Cost at scale.
Chat history, flexible records Firestore, Cosmos DB
Neo4j OSS
Graph DB โ€” knowledge graphs, entity relations
PROD OSS Best for relationship-heavy data: knowledge graphs, ontologies, recommendation engines. GraphRAG uses Neo4j as graph memory store. Cypher query language.
โš  Niche use case. Steep Cypher learning curve. Higher ops overhead.
GraphRAG, knowledge base Amazon Neptune, TigerGraph
Redis / Upstash OSS
In-memory key-value + vector store
PROD Cache Semantic cache for LLM responses (major cost saver). Session storage, rate limiting, real-time pub/sub. Redis Stack adds vector search.
โš  Not a primary DB. Memory-bound. Persistence requires config.
Caching, sessions, rate limits Memcached, DynamoDB
๐Ÿงฉ
Agent Memory Architecture
Short-term, long-term, episodic, semantic, procedural memory for AI agents
โ–ผ
๐Ÿ’ก
Decision principle: Memory evolves: V1 = static lookup โ†’ V2 = agentic retrieval โ†’ V3 = multi-source integration โ†’ V4 = background self-updating memory. Match complexity to actual need. Most MVPs need only V1-V2.
Tool / PatternPhaseMemory TypeWhy Choose ItTradeoffsComplexityAlternatives
In-context Window OSS
Pass full history in system prompt
MVP Working Memory Zero implementation. Works for short sessions. Sufficient for most chatbot MVPs. Claude/GPT-4o 128k+ context makes this viable longer.
โš  Context saturation. Token cost scales linearly. No persistence.
MINIMAL Summarization buffer
LangChain Memory Buffers OSS
ConversationBufferMemory, SummaryMemory
MVP Short-term + Summary Easy plug-in memory for LangChain chains. SummaryMemory compresses old turns, solving token budget problem. Backed by any DB.
โš  LangChain v1 memory deprecated in v0.3+. Moving to LangGraph preferred.
LOW LangGraph checkpointer
LangGraph Checkpointer OSS
Built-in state persistence for LangGraph agents
BOTH Working + Episodic Native state snapshot per turn. Supports resume, rollback, human-in-the-loop. Backends: SQLite (dev), PostgreSQL/Redis (prod). Most production-ready pattern.
โš  LangGraph specific. Adds graph definition overhead.
MED AutoGen state, custom DB
LangMem / LangGraph Store OSS
Long-term semantic memory SDK for LangGraph
BOTH Semantic + Episodic Cognitive memory model (semantic, episodic, procedural) baked into LangGraph. Cross-thread memory persistence. Best structured approach to agent long-term memory.
โš  Relatively new. Docs still maturing. LangGraph dependency.
MED Letta/MemGPT, mem0
Letta / MemGPT OSS
Paged memory OS for LLM agents
PROD Full cognitive model Most advanced open-source agent memory system. Paged memory (core/archival/recall), self-editing memory, multi-agent support. Ideal for long-running personal AI agents.
โš  Complex setup. Opinionated architecture. Smaller community.
HIGH LangMem, mem0
mem0 SaaS
Managed memory layer for AI apps
PROD Semantic LTM Managed service โ€” no infra. Automatically extracts, stores, retrieves memories across conversations. Good for SaaS AI products needing user-level personalization.
โš  SaaS cost. Data leaves infra. Early stage product.
LOW Letta, LangMem
๐Ÿ“ฆ
Data Strategy โ€” Ingestion & ETL
Document parsing, chunking, embedding, pipeline, and data connectors for RAG
โ–ผ
ToolPhaseRoleWhy Choose ItTradeoffsComplexityAlternatives
LlamaIndex OSS
Data framework for LLM โ€” parsing, indexing, querying
BOTH RAG Framework Best dedicated RAG toolkit. 160+ data connectors, advanced chunking strategies, query engines, reranking. Complements LangGraph for data-heavy RAG apps.
โš  Can be complex to configure. Some overlap with LangChain.
MED LangChain loaders
Unstructured.io OSS
Document parsing: PDF, Word, HTML, images
BOTH Doc Parser Best-in-class for extracting clean text from messy documents. Handles tables, headers, images in PDFs. Open-source core + managed API for scale.
โš  Managed API costs. Complex docs need tuning.
LOW Azure Document Intelligence, Docling
Azure Document Intelligence AZURE
OCR + layout analysis + form extraction
PROD Doc Parser Enterprise-grade for structured doc extraction (invoices, forms, contracts). Prebuilt models for common doc types. Tight Azure ecosystem integration.
โš  Pay-per-page pricing. Azure lock-in.
MED Unstructured, Textract
OpenAI / Azure Embeddings SaaS
text-embedding-3-small/large
BOTH Embeddings State-of-the-art embedding quality. text-embedding-3-small = best cost/quality tradeoff. Critical: embed query and docs with SAME model.
โš  Per-token cost. Data leaves infra. Changing models requires re-embedding.
LOW Cohere, BGE, E5
Apache Airflow / Prefect OSS
Workflow orchestration for data pipelines
PROD Pipeline Orchestration Schedule and monitor embedding refresh pipelines. Airflow for complex DAGs. Prefect for simpler Python-native flows. Essential for keeping vector store fresh.
โš  Infra overhead. Overkill for simple scheduled jobs.
HIGH Azure Data Factory, Dagster
๐Ÿ–ฅ๏ธ
UI / Frontend
Chat interfaces, dashboards, admin panels, streaming UX
โ–ผ
ToolPhaseTypeWhy Choose ItTradeoffsComplexityAlternatives
Streamlit OSS
Python-native rapid UI for data apps
MVP Python UI Fastest time-to-demo for Python AI apps. Built-in chat components, file upload, streaming support. Perfect for internal tools and client POCs.
โš  Not production-grade. Limited customization. Not for customer-facing apps.
LOW Gradio, Chainlit
Chainlit OSS
Chat UI framework built for LLM apps
MVP Chat UI Purpose-built for chatbot UIs. Step-by-step agent reasoning display, file attachments, streaming, auth. Best for quickly deploying a polished chat interface.
โš  Less flexible than full React app. Python-only backend.
LOW Streamlit, Open WebUI
Next.js + Vercel AI SDK OSS
React framework with streaming AI hooks
BOTH Full-Stack Best for customer-facing AI products. Vercel AI SDK handles streaming SSE, useChat/useCompletion hooks, model switching. Production-ready, beautiful UX possible.
โš  Requires frontend dev skills. More setup than Streamlit/Chainlit.
MED Remix, SvelteKit
React + FastAPI OSS
Decoupled frontend + AI backend
PROD Custom Stack Most flexible production architecture. FastAPI serves WebSocket / SSE streaming from Python agent. React consumes it. Full control, full customization.
โš  Most engineering effort. Need both Python and JS/TS skills.
HIGH Next.js + FastAPI, Django
Open WebUI OSS
Self-hosted ChatGPT-like interface
MVP Pre-built Instant ChatGPT-like UI for any OpenAI-compatible API. Docker deploy. Supports multiple models, RAG, web browsing. Zero UI development needed.
โš  Hard to customize deeply. More suited for internal tools.
MINIMAL Chatbot UI, LibreChat
โšก
Backend / API Layer
REST/WebSocket API servers, streaming, auth, rate limiting
โ–ผ
ToolPhaseTypeWhy Choose ItTradeoffsComplexityAlternatives
FastAPI OSS
Modern async Python REST + WebSocket server
BOTH Framework Default choice for Python AI backends. Async-native for streaming LLM responses. Auto OpenAPI docs. SSE support built-in. Uvicorn/Gunicorn for prod.
โš  Python GIL limits true parallelism. Need separate worker scaling strategy.
LOW Django REST, Flask
Azure Functions AZURE
Serverless compute for event-driven AI logic
PROD Serverless Best for event-driven tasks: webhook handlers, real-time voice pipeline steps, scheduled embedding refresh. Pay-per-execution. Tight Azure integration.
โš  Cold start latency. Execution time limits. Azure vendor lock-in.
MED AWS Lambda, Google Cloud Run
Azure API Management AZURE
API gateway with rate limiting, auth, throttling
PROD Gateway Enterprise API gateway: rate limiting per user/key, token quota management, load balancing across Azure OpenAI PTU pools, auth, analytics. Essential for multi-tenant AI APIs.
โš  Complex setup. Azure-only. Licensing cost.
HIGH Kong, AWS API Gateway
๐Ÿš€
Containerization & Deployment
Docker, orchestration, CI/CD, cloud deployment targets
โ–ผ
ToolPhaseLayerWhy Choose ItTradeoffsComplexityAlternatives
Docker + Compose OSS
Container runtime + local multi-service orchestration
BOTH Runtime Universal standard. Compose for local dev with multiple services (API + vector DB + Redis). Same image promotes from dev to prod. No team should ship without this.
โš  Not for production orchestration at scale โ€” use K8s or managed containers.
LOW Podman
Azure Container Apps AZURE
Serverless K8s-based container hosting
PROD Hosting Best managed container platform on Azure. Auto-scaling to zero, KEDA-based event scaling, Dapr integration. Much simpler than AKS for most AI app deployments.
โš  Less control than AKS. Not for stateful workloads or GPU inference.
MED Azure App Service, AKS
AKS (Azure Kubernetes) AZURE
Managed Kubernetes โ€” full control
PROD Orchestration Required for GPU workloads (self-hosted LLMs), complex microservice AI architectures, custom networking/security requirements, or high-throughput production AI APIs.
โš  High ops complexity. Requires K8s expertise. Cost.
HIGH GKE, EKS
Railway / Render SaaS
Simple PaaS for containerized apps
MVP PaaS Hosting Deploy FastAPI + Postgres + Redis in minutes. No Kubernetes. Git push deploys. Best for rapid MVP delivery when infra is not the focus.
โš  Not enterprise-grade. Limited compliance controls. Vendor dependency.
LOW Fly.io, Heroku
GitHub Actions SaaS
CI/CD pipeline automation
BOTH CI/CD Standard CI/CD for most projects. Build โ†’ test โ†’ push Docker image โ†’ deploy to container platform. Free for public repos, generous free tier for private.
โš  Complex pipelines become unwieldy. Azure DevOps better for deep Azure integration.
LOW Azure DevOps, GitLab CI
๐Ÿ“Š
Logging, Monitoring & Observability
LLM tracing, cost tracking, latency monitoring, error alerting
โ–ผ
๐Ÿ’ก
Decision principle: AI observability โ‰  traditional APM. You need LLM-specific tracing (prompt/response capture, token cost per run, chain step visibility). Add LangSmith or LangFuse early โ€” you'll regret not having it in production debugging sessions.
ToolPhaseFocusWhy Choose ItTradeoffsComplexityAlternatives
LangSmith SaaS
LangChain's LLM observability platform
BOTH LLM Tracing Best-in-class for LangChain/LangGraph apps. Auto-traces every chain/agent step, shows prompt/response, token cost, latency per node. Essential for debugging agent loops.
โš  SaaS cost at scale. LangChain ecosystem only (though SDK is broader).
LOW LangFuse, Arize
LangFuse OSS
Open-source LLM observability โ€” self-hostable
BOTH LLM Tracing Framework-agnostic LLM observability. Self-hostable (data sovereignty). Covers traces, evals, datasets, prompt management. Best OSS alternative to LangSmith.
โš  Self-hosting adds ops. Smaller community than LangSmith.
MED LangSmith, Helicone
Azure Monitor + App Insights AZURE
Full-stack Azure observability platform
PROD Platform Monitoring Unified logs, metrics, traces for Azure-hosted apps. KQL queries for log analysis. Custom dashboards, alerting, distributed tracing. Required for enterprise Azure deployments.
โš  Not LLM-specific. KQL learning curve. Cost by data volume.
MED Datadog, Grafana Stack
Prometheus + Grafana OSS
Metrics collection + visualization stack
PROD Infra Metrics Standard OSS metrics stack. Instrument FastAPI/vLLM with Prometheus exporters. Grafana dashboards for throughput, latency, token rates, queue depths, GPU utilization.
โš  Infra overhead. Not LLM-aware out of the box โ€” requires custom metrics.
HIGH Datadog, Azure Monitor
Helicone SaaS
LLM cost + usage analytics proxy
PROD Cost Tracking Drop-in proxy for OpenAI/Anthropic APIs. Tracks cost per user/session, caches responses, rate limits. Excellent for multi-tenant SaaS AI products where per-user cost matters.
โš  Sits in request path โ€” adds latency. SaaS dependency.
LOW LangFuse, OpenLLMetry
๐ŸŽ™๏ธ
Voice AI Stack
STT, TTS, real-time voice pipeline, telephony integration
โ–ผ
๐Ÿ’ก
Decision principle: Voice latency perception <500ms end-to-end is the target. Real-time voice = WebSocket/WebRTC throughout (no HTTP polling). STT โ†’ LLM โ†’ TTS pipeline must all support streaming. For telephony: ACS Call Automation on Azure, Twilio elsewhere.
ToolPhaseLayerWhy Choose ItTradeoffsComplexityAlternatives
Whisper / Azure Speech STT OSS
Speech-to-text transcription
BOTH STT Whisper (OSS): best accuracy, 100+ languages, self-hostable. Azure Speech: managed, streaming, real-time, enterprise SLA. Use Azure for production voice pipelines on Azure stack.
โš  Whisper: not real-time by default. Azure: cost per audio hour.
MED Deepgram, AssemblyAI
Deepgram SaaS
Real-time STT with ultra-low latency
BOTH STT (Real-time) Best-in-class for real-time streaming STT. ~300ms latency. WebSocket API. Significantly faster than Azure Speech for live voice agent use cases.
โš  SaaS cost. Data leaves infra. Per-minute pricing.
LOW Azure Speech, AssemblyAI
ElevenLabs / Azure TTS SaaS
Text-to-speech synthesis
BOTH TTS ElevenLabs: most natural voice quality, streaming TTS, voice cloning. Azure TTS: enterprise-grade, 400+ voices, Azure integration, Neural TTS. Choose based on naturalness vs compliance needs.
โš  Per-character cost. Voice cloning raises ethical/legal issues.
LOW OpenAI TTS, PlayHT
Azure ACS + Call Automation AZURE
Telephony + real-time voice pipeline on Azure
PROD Telephony Enterprise telephony integration (PSTN, SIP). Call Automation API for programmatic call control, real-time transcription, media streaming to Azure Functions. ART Accelerator pattern.
โš  Complex setup. ACS + Functions + Event Grid architecture. Azure-only.
HIGH Twilio, Vonage
OpenAI Realtime API SaaS
End-to-end real-time voice (GPT-4o)
BOTH Full Voice Pipeline Single WebSocket API for STT + LLM + TTS in one round-trip. Dramatically simplifies voice architecture. Voice activity detection included. Best for MVP voice agents.
โš  Expensive. Less control over individual pipeline stages. OpenAI lock-in.
LOW LiveKit, Daily.co + custom
๐Ÿ”
Security, Auth & Guardrails
Identity, access control, prompt injection protection, output guardrails
โ–ผ
ToolPhaseLayerWhy Choose ItTradeoffsComplexityAlternatives
Auth0 / Azure AD B2C SaaS
Identity and access management
BOTH Auth/Identity Auth0: fastest MVP auth, any stack, social logins, MFA. Azure AD B2C: enterprise identity for Azure-native apps, SAML/OIDC, conditional access. Don't build auth from scratch.
โš  Auth0 cost at scale. B2C complex config. Vendor dependency.
LOW Clerk, Supabase Auth
Azure Key Vault AZURE
Secrets, keys, certificates management
BOTH Secrets Mgmt Never store API keys in code or env files for client deployments. Key Vault + Managed Identity = zero-credential access pattern. Required for enterprise Azure deployments.
โš  Azure-specific. Adds latency if not cached.
LOW HashiCorp Vault, AWS Secrets Manager
Guardrails AI / NeMo Guardrails OSS
Output validation and prompt safety rails
PROD LLM Safety Validate LLM outputs against schemas, PII detection, topic restrictions, hallucination checks. NeMo: NVIDIA's rail framework with Colang language for policy definition.
โš  Adds latency per call. Config overhead. False positives on edge cases.
MED Azure Content Safety, Rebuff
Azure Content Safety AZURE
Harmful content detection API
PROD Content Moderation Managed API for hate speech, violence, sexual content detection in both inputs and outputs. Promptshield for jailbreak/prompt injection detection. Required for public-facing Azure AI apps.
โš  Per-call cost. Azure lock-in. Latency addition.
LOW Guardrails AI, OpenAI moderation